Business Understanding

[10 points] Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific.

Purpose of the data set: This crime dataset was supplied by LA County. It is Crime Data from 2010 to 2019 and it uses uniform crime reporting (UCR) created by the FBI to summarize incident info in a repeatable and comparable way to other cities, counties, and states. The LA City Mayor wrote that the purpose of the data was for transparency's sake as well as encouraging those outside of the government to be able to use the data for the sake of innovation and solving problems. [Ref: https://data.lacity.org/]

Data Source: The data set used for the purposes of this project is sourced from LOS ANGELES OPEN DATA.

Data Importance: Analyzing Crime data over several years helps identify high-risk populations and demographies and take corrective action to reduce crime and make the city safer.

Mining useful knowledge and measuring effectiveness of prediction algorithm:
For this data set we would like to predict the following features:

  • Feature A: Victim's Age:
    • Victim's Age is a continuous variable and it makes Feature A a regression problem.
    • We will use a 10-fold cross validation to measure the effectiveness of a good prediction algorithm.
    • Since Feature A is a regression problem, we will use metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) or Mean Absolute Error Percentage (MAEP) to measure the success of the prediction.
  • Feature B: Victim's Age-Group:
    • Victim's Age-Group is a discrete/categorical variable and it makes Feature B a classification problem.
    • We will use a 10-fold cross validation to measure the effectiveness of a good prediction algorithm.
    • Since Feature B is a classification problem, we will use metrics such as Accuracy, Precision Recall, F1 Score, ROC curve or the Area Under the Curve to measure the success of the prediction.
  • Feature C: Victim's Sex:
    • Victim's Sex is a discrete/categorical variable and it makes Feature C a classification problem.
    • We will use a 10-fold cross validation to measure the effectiveness of a good prediction algorithm.
    • Since Feature C is a classification problem, we will use metrics such as Accuracy, Precision Recall, F1 Score, ROC curve or the Area Under the Curve to measure the success of the prediction.

Import various modules

In [1]:
import pandas as pd
In [2]:
import numpy as np
In [3]:
import matplotlib.pyplot as plt
In [4]:
import descartes
In [5]:
import plotly.express as px
In [6]:
#!pip install geopandas
In [7]:
import geopandas as gpd
from shapely.geometry import Point, Polygon

%matplotlib inline

Data Meaning Type

[10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

In [8]:
# Load the LA Crimes data set into pandas dataframe

df = pd.read_csv("Data/Crime_Data_from_2010_to_2019.csv")
In [9]:
# List the total number of rows and columns in the dataframe

print("Total number of rows in the dataframe: " + str(df.shape[0]))
print("Total number of columns in the dataframe: " + str(df.shape[1]))
Total number of rows in the dataframe: 2115333
Total number of columns in the dataframe: 28
In [10]:
# Displaying the data types of each column/attribute

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2115333 entries, 0 to 2115332
Data columns (total 28 columns):
 #   Column          Dtype  
---  ------          -----  
 0   DR_NO           int64  
 1   Date Rptd       object 
 2   DATE OCC        object 
 3   TIME OCC        int64  
 4   AREA            int64  
 5   AREA NAME       object 
 6   Rpt Dist No     int64  
 7   Part 1-2        int64  
 8   Crm Cd          int64  
 9   Crm Cd Desc     object 
 10  Mocodes         object 
 11  Vict Age        int64  
 12  Vict Sex        object 
 13  Vict Descent    object 
 14  Premis Cd       float64
 15  Premis Desc     object 
 16  Weapon Used Cd  float64
 17  Weapon Desc     object 
 18  Status          object 
 19  Status Desc     object 
 20  Crm Cd 1        float64
 21  Crm Cd 2        float64
 22  Crm Cd 3        float64
 23  Crm Cd 4        float64
 24  LOCATION        object 
 25  Cross Street    object 
 26  LAT             float64
 27  LON             float64
dtypes: float64(8), int64(7), object(13)
memory usage: 451.9+ MB

Upon committing to GitHub we discovered that the dataset was beyond a file limit of 100MB. In order to get this dataset to run from GitHub, we will be using the API method provided by LA City. However, this dataset does not resolve to the same datatypes when the JSON response is pulled in. Therefore transformation will have to be made to coerce these fields to be identical to our CSV imported datasets we were able to use locally.

In [11]:
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.lacity.org", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.lacity.org,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("63jg-8b9z", limit=10000)
#results = client.get("63jg-8b9z", limit=10000)



# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
results_df.info()



results_df['lat'] = results_df['lat'].astype(float)
results_df['lon'] = results_df['lon'].astype(float)

#results_df.lat = int(results_df.lat)
#results_df.lat = int(results_df.lon)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   dr_no           10000 non-null  object
 1   date_rptd       10000 non-null  object
 2   date_occ        10000 non-null  object
 3   time_occ        10000 non-null  object
 4   area            10000 non-null  object
 5   area_name       10000 non-null  object
 6   rpt_dist_no     10000 non-null  object
 7   part_1_2        10000 non-null  object
 8   crm_cd          10000 non-null  object
 9   crm_cd_desc     10000 non-null  object
 10  mocodes         8794 non-null   object
 11  vict_age        10000 non-null  object
 12  vict_sex        9281 non-null   object
 13  vict_descent    9281 non-null   object
 14  premis_cd       9999 non-null   object
 15  premis_desc     9999 non-null   object
 16  status          10000 non-null  object
 17  status_desc     10000 non-null  object
 18  crm_cd_1        10000 non-null  object
 19  location        10000 non-null  object
 20  lat             10000 non-null  object
 21  lon             10000 non-null  object
 22  cross_street    2927 non-null   object
 23  weapon_used_cd  4273 non-null   object
 24  weapon_desc     4273 non-null   object
 25  crm_cd_2        772 non-null    object
 26  crm_cd_3        14 non-null     object
 27  crm_cd_4        1 non-null      object
dtypes: object(28)
memory usage: 2.1+ MB

Based on the outputs above we have the following observations for the LA Crimes data set:

  • Total number of rows in the dataframe are: 2115333
  • Total number of columns in the dataframe are: 28
  • The data is saved as 3 main categories:
    • 8 columns with float64 values.
    • 7 columns with int64 values.
    • 13 columns with nominal object values.

A detailed description of each attribute along with its meaning and data type are displayed in a table below.
Description information in the table below is sourced from LOS ANGELES OPEN DATA.

In [12]:
# Load Data Description file into pandas dataframe
data_desc = pd.read_csv('Data/Data_Description.csv')

from IPython.display import display, HTML

display(HTML(data_desc.to_html()))
Column_Name Description Data_Type
0 DR_NO Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits Plain Text
1 Date Rptd Date Crime was Reported in format: MM/DD/YYYY Date & Time
2 DATE OCC Date Crime Occurred in format: MM/DD/YYYY Date & Time
3 TIME OCC Time Crime Occurred in 24 hour military time. Plain Text
4 AREA The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21. Plain Text
5 AREA NAME The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. Plain Text
6 Rpt Dist No A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Plain Text
7 Part 1-2 - Number
8 Crm Cd Indicates the crime committed. (Same as Crime Code 1) Plain Text
9 Crm Cd Desc Defines the Crime Code provided. Plain Text
10 Mocodes Modus Operandi: Activities associated with the suspect in commission of the crime.See attached PDF for list of MO Codes in numerical order. Plain Text
11 Vict Age Age of the Victim: Two character numeric Plain Text
12 Vict Sex Sex of the Victim:: F - Female| M - Male |X - Unknown Plain Text
13 Vict Descent Descent of the Victim (Code): A - Other Asian | B - Black | C - Chinese | D - Cambodian | F - Filipino | G - Guamanian | H - Hispanic/Latin/Mexican | I - American Indian/Alaskan Native | J - Japanese | K - Korean | L - Laotian | O - Other | P - Pacific Islander | S - Samoan | U - Hawaiian | V - Vietnamese | W - White | X - Unknown | Z - Asian Indian Plain Text
14 Premis Cd The type of structure, vehicle, or location where the crime took place. Plain Text
15 Premis Desc Defines the Premise Code provided. Plain Text
16 Weapon Used Cd The type of weapon used in the crime. Plain Text
17 Weapon Desc Defines the Weapon Used Code provided. Plain Text
18 Status Status of the case. (IC is the default) Plain Text
19 Status Desc Defines the Status Code provided. Plain Text
20 Crm Cd 1 Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious. Plain Text
21 Crm Cd 2 May contain a code for an additional crime, less serious than Crime Code 1. Plain Text
22 Crm Cd 3 May contain a code for an additional crime, less serious than Crime Code 1. Plain Text
23 Crm Cd 4 May contain a code for an additional crime, less serious than Crime Code 1. Plain Text
24 LOCATION Street address of crime incident rounded to the nearest hundred block to maintain anonymity. Plain Text
25 Cross Street Cross Street of rounded Address Plain Text
26 LAT Latitude Number
27 LON Longtitude Number

Additional details about the data set can be found in the links below:

Link to Modus Operendi codes

Link to LAPD Reporting Districts

Data Quality

[15 points] Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods.

Exploring the Dataset

In [13]:
# Displaying the first 2 lines of the dataframe

df.head(2)
Out[13]:
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
0 1307355 02/20/2010 12:00:00 AM 02/20/2010 12:00:00 AM 1350 13 Newton 1385 2 900 VIOLATION OF COURT ORDER ... AA Adult Arrest 900.0 NaN NaN NaN 300 E GAGE AV NaN 33.9825 -118.2695
1 11401303 09/13/2010 12:00:00 AM 09/12/2010 12:00:00 AM 45 14 Pacific 1485 2 740 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... ... IC Invest Cont 740.0 NaN NaN NaN SEPULVEDA BL MANCHESTER AV 33.9599 -118.3962

2 rows × 28 columns

In [14]:
# Summary of attributes in the dataframe

df.describe()
Out[14]:
DR_NO TIME OCC AREA Rpt Dist No Part 1-2 Crm Cd Vict Age Premis Cd Weapon Used Cd Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LAT LON
count 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115280e+06 710470.000000 2.115323e+06 139402.000000 3499.000000 104.000000 2.115333e+06 2.115333e+06
mean 1.479209e+08 1.359865e+03 1.108682e+01 1.155070e+03 1.446029e+00 5.073349e+02 3.176409e+01 3.111814e+02 371.371934 5.071590e+02 949.996428 972.210346 975.105769 3.406377e+01 -1.183088e+02
std 2.887068e+07 6.470967e+02 6.012440e+00 6.012589e+02 4.970787e-01 2.106272e+02 2.064750e+01 2.113121e+02 113.469024 2.104900e+02 125.680743 85.516627 81.276338 7.115120e-01 2.440446e+00
min 1.208575e+06 1.000000e+00 1.000000e+00 1.000000e+02 1.000000e+00 1.100000e+02 -9.000000e+00 1.010000e+02 101.000000 1.100000e+02 210.000000 93.000000 421.000000 0.000000e+00 -1.188279e+02
25% 1.214242e+08 9.300000e+02 6.000000e+00 6.430000e+02 1.000000e+00 3.300000e+02 2.000000e+01 1.020000e+02 400.000000 3.300000e+02 998.000000 998.000000 998.000000 3.401050e+01 -1.184364e+02
50% 1.508087e+08 1.430000e+03 1.100000e+01 1.189000e+03 1.000000e+00 4.420000e+02 3.200000e+01 2.100000e+02 400.000000 4.420000e+02 998.000000 998.000000 998.000000 3.406230e+01 -1.183295e+02
75% 1.715119e+08 1.900000e+03 1.600000e+01 1.668000e+03 2.000000e+00 6.260000e+02 4.600000e+01 5.010000e+02 400.000000 6.260000e+02 998.000000 998.000000 998.000000 3.417580e+01 -1.182778e+02
max 9.102204e+08 2.359000e+03 2.100000e+01 2.199000e+03 2.000000e+00 9.560000e+02 1.180000e+02 9.710000e+02 516.000000 9.990000e+02 999.000000 999.000000 999.000000 3.479070e+01 0.000000e+00

Exploring Column: Vict Descent

In [15]:
# Display missing values in the Column "Vict Descent"

print("Missing values in the Column 'Vict Descent' are: " + str(df['Vict Descent'].isnull().sum()))
Missing values in the Column 'Vict Descent' are: 196726
In [16]:
# Display number of victims grouped by their descent
# Descent Code:
# A: Other Asian, B: Black, C: Chinese, D: Cambodian, F: Filipino, G: Guamanian, H: Hispanic/Latin/Mexican, 
# I: American Indian/Alaskan Native, J: Japanese, K: Korean, L: Laotian, O: Other, P: Pacific Islander, 
# S: Samoan, U: Hawaiian, V: Vietnamese, W: White, X: Unknown, Z: Asian Indian

print()
print("Number of victims grouped by their descent are as below:")
df.groupby('Vict Descent').size()
Number of victims grouped by their descent are as below:
Out[16]:
Vict Descent
-         3
A     51128
B    335200
C      1063
D        23
F      2558
G        85
H    725576
I       945
J       418
K      9151
L        18
O    203029
P       343
S        31
U       190
V       201
W    510333
X     78176
Z       136
dtype: int64

Based on the outputs above we have the following observations for the 'Vict Descent' column:

  • There are 3 records which do not have a valid Descent code. Those records are indicated by a hypen '-'. Choosing Descent Code 'X' (Unknown) for these records would be appropriate.
  • There are 196726 records which are blank. These blank records are associated with Crime Code Descriptions such as "THEFT OF IDENTITY", "VEHICLE - STOLEN" etc. Hence, choosing Descent Code 'X' (Unknown) for these blank records would be appropriate.

Based on further analysis by looking at a random selection of records it is concluded that these could have been data entry errors. Hence, we can replace those records with Descent Code 'X', which is for Unknown category.

Imputation on Column: Vict Descent

In [17]:
# Replace records in the 'Vict Descent' column having '-' with 'X'

df['Vict Descent'] = df['Vict Descent'].replace(to_replace='-',value='X')
In [18]:
# Replace records in the 'Vict Descent' column having blanks with 'X'

df['Vict Descent'].fillna('X', inplace = True) 

Post-Imputation on Column: Vict Descent

In [19]:
# Display missing values in the Column "Vict Descent" and also display the number of victims grouped by their descent

print("Missing values in the Column 'Vict Descent' are: " + str(df['Vict Descent'].isnull().sum()))
print()
print("Number of victims grouped by their descent are as below:")
df.groupby('Vict Descent').size()
Missing values in the Column 'Vict Descent' are: 0

Number of victims grouped by their descent are as below:
Out[19]:
Vict Descent
A     51128
B    335200
C      1063
D        23
F      2558
G        85
H    725576
I       945
J       418
K      9151
L        18
O    203029
P       343
S        31
U       190
V       201
W    510333
X    274905
Z       136
dtype: int64

As seen in the output above, the Vict Descent category 'X' now has 274905 records.

Exploring Column: Vict Sex

In [20]:
# Display missing values in the Column "Vict Sex"

print("Missing values in the Column 'Vict Sex' are: " + str(df['Vict Sex'].isnull().sum()))
Missing values in the Column 'Vict Sex' are: 196680
In [21]:
# Display number of victims grouped by their Sex
# Sex Code:
# F: Female, M: Male, X: Unknown

print()
print("Number of victims grouped by their Sex are as below:")
df.groupby('Vict Sex').size()
Number of victims grouped by their Sex are as below:
Out[21]:
Vict Sex
-         1
F    888881
H        73
M    974525
N        17
X     55156
dtype: int64

Based on the outputs above we have the following observations for the 'Vict Sex' column:

  • There are 196680 records which are blank. These blank records are associated with Crime Code Descriptions such as "THEFT OF IDENTITY", "VEHICLE - STOLEN" etc. Hence, choosing Sex Code 'X' (Unknown) for these blank records would be appropriate.
  • There is 1 record which does not have a valid Sex code. This record is indicated by a hypen '-'. Choosing Sex Code 'X' (Unknown) for the record would be appropriate.
  • There are 73 records with an invalid code 'H' and 17 records with an invalid code 'N'. Choosing Sex Code 'X' (Unknown) for these records would be appropriate.

Based on further analysis by looking at a random selection of records it is concluded that these could have been data entry errors. Hence, we can replace those records with Sex Code 'X', which is for Unknown category.

Imputation on Column: Vict Sex

In [22]:
# Replace records in the 'Vict Sex' column having '-' with 'X'

df['Vict Sex'] = df['Vict Sex'].replace(to_replace='-',value='X')
In [23]:
# Replace records in the 'Vict Sex' column having 'H' with 'X'

df['Vict Sex'] = df['Vict Sex'].replace(to_replace='H',value='X')
In [24]:
# Replace records in the 'Vict Sex' column having 'N' with 'X'

df['Vict Sex'] = df['Vict Sex'].replace(to_replace='N',value='X')
In [25]:
# Replace records in the 'Vict Sex' column having blanks with 'X'

df['Vict Sex'].fillna('X', inplace = True) 

Post-Imputation on Column: Vict Sex

In [26]:
# Display missing values in the Column "Vict Sex" and also display the number of victims grouped by their Sex

print("Missing values in the Column 'Vict Sex' are: " + str(df['Vict Sex'].isnull().sum()))
print()
print("Number of victims grouped by their Sex are as below:")
df.groupby('Vict Sex').size()
Missing values in the Column 'Vict Sex' are: 0

Number of victims grouped by their Sex are as below:
Out[26]:
Vict Sex
F    888881
M    974525
X    251927
dtype: int64

As seen in the output above, the Vict Sex category 'X' now has 251927 records.

Exploring Column: Vict Age

In [27]:
# Display missing values in the Column "Vict Age"

print("Missing values in the Column 'Vict Age' are: " + str(df['Vict Age'].isnull().sum()))
Missing values in the Column 'Vict Age' are: 0
In [28]:
# Display number of victims grouped by their Age

print()
print("Number of victims grouped by their Age are as below:")
df.groupby('Vict Age').size()
Number of victims grouped by their Age are as below:
Out[28]:
Vict Age
-9        6
-8        9
-7       15
-6       20
-5       27
       ... 
 97     168
 98     118
 99     809
 114      1
 118      1
Length: 110, dtype: int64
In [29]:
# Display number of victims grouped by their Age, where Age value is less than or equal to zero

print()
print("Number of victims grouped by their Age, where Age value is less than or equal to zero, are as below:")
df[df['Vict Age']<=0].groupby('Vict Age').size()
Number of victims grouped by their Age, where Age value is less than or equal to zero, are as below:
Out[29]:
Vict Age
-9         6
-8         9
-7        15
-6        20
-5        27
-4        35
-3        84
-2       129
-1       277
 0    369925
dtype: int64
In [30]:
# Display total number of records with invalid age values
# Age values zero and less are considered invalid age values

print()
print("Total number of records with invalid age values are: " + str( df[df['Vict Age']<=0]['Vict Age'].count() ))
Total number of records with invalid age values are: 370527
In [31]:
# Display number of victims grouped by their Age, where Age value is greater than or equal to 100

print()
print("Number of victims grouped by their Age, where Age value is greater than or equal to 100, are as below:")
df[df['Vict Age']>=100].groupby('Vict Age').size()
Number of victims grouped by their Age, where Age value is greater than or equal to 100, are as below:
Out[31]:
Vict Age
114    1
118    1
dtype: int64
In [32]:
# Display overall median victim age value

print()
print("Overall median victim age value in this data set is: " + str( df['Vict Age'].median() ))
Overall median victim age value in this data set is: 32.0
In [33]:
# Display median victim age value of children (>0 & <18)

print()
print("Median victim age value of children (>0 & <18) in this data set is: " + 
      str( df[(df['Vict Age']>0) & (df['Vict Age']<18)]['Vict Age'].median() ))
Median victim age value of children (>0 & <18) in this data set is: 14.0
In [34]:
# Display number of records grouped by 'Crime Code Description' categories that 
# contain the word "CHILD" and with invalid age value less than or equal to zero

df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ].groupby('Crm Cd Desc').size()
Out[34]:
Crm Cd Desc
CHILD ABANDONMENT                                30
CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT     330
CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT         383
CHILD ANNOYING (17YRS & UNDER)                   67
CHILD NEGLECT (SEE 300 W.I.C.)                 1102
CHILD PORNOGRAPHY                                16
CHILD STEALING                                   83
LEWD/LASCIVIOUS ACTS WITH CHILD                   5
dtype: int64
In [35]:
# Display number of records grouped by 'Vict Age' that 
# contain the word "CHILD" and with invalid age value less than or equal to zero

df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ].groupby('Vict Age').size()
Out[35]:
Vict Age
-3       2
-2       1
-1       5
 0    2008
dtype: int64
In [36]:
# Display total number of records that contain the word "CHILD" in the Crime Code Desc and 
# have an invalid age value less than or equal to zero

temp = df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ]['Vict Age'].count()
print("Total number of records that contain the word 'CHILD' in the Crime Code Desc and have an invalid age value less than or equal to zero are: " + str(temp))
Total number of records that contain the word 'CHILD' in the Crime Code Desc and have an invalid age value less than or equal to zero are: 2016

Based on the outputs above we have the following observations for the 'Vict Age' column:

  • There are zero records which are blank.
  • There are 2 records where the age value is greater than 100. These can be considered as outliers. Since the number of records associated with these outliers are very small, we have concluded not to perform any imputations on these records.
  • There are 370527 records with invalid age values, of which 2016 records contain the word 'CHILD' in the Crime Code Description.
  • Overall median victim age value in this data set is: 32.0.
  • Median victim age value of children (>0 & <18) in this data set is: 14.0.

Based on further analysis by looking at a random selection of records it is concluded that these invalid age values could have been data entry errors.

Hence, we have concluded to perfom imputations as below:

  • For the records that contain the word 'CHILD' in the Crime Code Desc and have an invalid age value less than or equal to zero, we would replace the invalid age with the Median victim age value for children which is 14.
  • For the remaining records that have an invalid age value less than or equal to zero, we would replace the invalid age with the Overall median victim age value which is 32.

Imputation on Column: Vict Age

In [37]:
# Replace records that contain the word 'CHILD' in the Crime Code Desc and
# have an invalid age value less than or equal to zero, with
# the Median victim age value for children which is 14.
# As seen in the outputs above, for this particular category the invalid age values are [0, -1, -2, -3]

temp = df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ]['Vict Age']
temp = temp.replace(to_replace=[0, -1, -2, -3],value=14)

df.loc[ ( (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ), 'Vict Age'] = temp
In [38]:
# Now replace the remaining records that have an invalid age value less than or 
# equal to zero, with the Overall median victim age value which is 32.
# As seen in the outputs above, for this particular category the invalid age values
# are [0, -1, -2, -3, -4, -5, -6, -7, -8, -9]

df.loc[ df['Vict Age']<=0, 'Vict Age' ] = 32

Post-Imputation on Column: Vict Age

In [39]:
# Display total number of records with invalid age values
# Age values zero and less are considered invalid age values

print()
print("Total number of records with invalid age values are: " + str( df[df['Vict Age']<=0]['Vict Age'].count() ))
Total number of records with invalid age values are: 0
In [40]:
# Display number of victims grouped by their Age

print()
print("Number of victims grouped by their Age are as below:")
df.groupby('Vict Age').size()
Number of victims grouped by their Age are as below:
Out[40]:
Vict Age
2      1465
3      1799
4      2095
5      2375
6      2433
       ... 
97      168
98      118
99      809
114       1
118       1
Length: 100, dtype: int64

As seen in the output above, there are now NO records with invalid age values.

Exploring Columns: 'LAT' and 'LON'

In [41]:
# Display missing values in Columns 'LAT' and 'LON'

print()
print("Missing values in the Column 'LAT' are: " + str(df['LAT'].isnull().sum()))
print("Missing values in the Column 'LON' are: " + str(df['LON'].isnull().sum()))
Missing values in the Column 'LAT' are: 0
Missing values in the Column 'LON' are: 0
In [42]:
# Display records grouped by LAT values

print()
print("Records grouped by Latitude values:")
df.groupby('LAT').size()
Records grouped by Latitude values:
Out[42]:
LAT
0.0000     898
33.3427      6
33.7058      3
33.7060     38
33.7062     35
          ... 
34.6648      1
34.6765      1
34.6828      2
34.7060      2
34.7907      1
Length: 5421, dtype: int64
In [43]:
# Display records grouped by LON values

print()
print("Records grouped by Longitude values:")
df.groupby('LON').size()
Records grouped by Longitude values:
Out[43]:
LON
-118.8279      1
-118.8276      1
-118.7668      6
-118.6677      5
-118.6673     27
            ... 
-117.7115      1
-117.7100      1
-117.7059      1
-117.6596      1
 0.0000      898
Length: 5091, dtype: int64
In [44]:
# Display mean LAT and LON values

print()
mean_LAT = df['LAT'].mean()
mean_LON = df['LON'].mean()
print( "Mean latitude value in the current data set is: " + str(mean_LAT) )
print( "Mean longitude value in the current data set is: " + str(mean_LON) )
Mean latitude value in the current data set is: 34.06376940817359
Mean longitude value in the current data set is: -118.30882522231715

Based on the outputs above we have the following observations for the 'LAT' and 'LON' columns:

  • There are 898 records which have a latitude value of 0.0000.
  • There are 898 records which have a longitude value of 0.0000.

According to Google maps, the latitude of Los Angeles, CA, USA is 34.052235, and the longitude is -118.243683. Which indicates that the LAT and LON values with 0.0000 are invalid.

Based on further analysis by looking at a random selection of records it is concluded that these invalid values could have been data entry errors.

Hence, we have concluded to replace these invalid values with the 'mean' values.

Imputation on Columns: 'LAT' and 'LON'

In [45]:
# Replace invalid 'LAT' and 'LON' values with their mean values respectively

df.loc[ df['LAT']==0, 'LAT' ] = mean_LAT
df.loc[ df['LON']==0, 'LON' ] = mean_LON

Post-Imputation on Columns: 'LAT' and 'LON'

In [46]:
# Display min and max values of 'LAT' and 'LON' columns

print()
print( "Min LAT is: " + str(df['LAT'].min()) + " and " + "Max LAT is: " + str(df['LAT'].max()) )
print( "Min LON is: " + str(df['LON'].min()) + " and " + "Max LON is: " + str(df['LON'].max()) )
Min LAT is: 33.3427 and Max LAT is: 34.7907
Min LON is: -118.8279 and Max LON is: -117.6596

As seen in the output above, there are now NO records with invalid LAT and LON values.

Exploring Columns: 'AREA' and 'AREA NAME'

In [47]:
# Cosmetic correction
# When exploring the column name for AREA, a trailiing space was found.
# Instead of having the column name as 'AREA' we have 'AREA '
# The code in this cell is to remove the trailing space

df = df.rename(columns = {"AREA ":"AREA"})
In [48]:
# Display records grouped by AREA and their corresponding AREA NAME

print()
print("Records grouped by 'AREA' and their corresponding 'AREA NAME':")
df.groupby(['AREA', 'AREA NAME']).size()
Records grouped by 'AREA' and their corresponding 'AREA NAME':
Out[48]:
AREA  AREA NAME  
1     Central         98289
2     Rampart         89576
3     Southwest      135437
4     Hollenbeck      77915
5     Harbor          92007
6     Hollywood       98989
7     Wilshire        88590
8     West LA         89047
9     Van Nuys        99704
10    West Valley     89562
11    Northeast      100252
12    77th Street    145272
13    Newton         100002
14    Pacific        112522
15    N Hollywood    113905
16    Foothill        79855
17    Devonshire      96478
18    Southeast      111457
19    Mission        103568
20    Olympic         95314
21    Topanga         97592
dtype: int64
In [49]:
# Identify missing values in 'AREA' and 'AREA NAME' columns

print()
print( "Missing values in column 'AREA' are: " + str(df['AREA'].isnull().sum()) )
print( "Missing values in column 'AREA NAME' are: " + str(df['AREA NAME'].isnull().sum()) )
Missing values in column 'AREA' are: 0
Missing values in column 'AREA NAME' are: 0

Based on the outputs above we have the following observations for 'AREA' and 'AREA NAME' columns:

  • There are NO missing values in these columns.
  • There are NO spelling mistakes or any other data entry errors in these columns.
  • Each AREA code is correctly mapped to its corresponding AREA NAME.

Hence no modifications would be made to these columns.

Exploring Columns: 'Status' and 'Status Desc'

In [50]:
# Identify missing values in 'Status' and 'Status Desc' columns

print()
print( "Missing values in column 'Status' are: " + str(df['Status'].isnull().sum()) )
print( "Missing values in column 'Status Desc' are: " + str(df['Status Desc'].isnull().sum()) )
Missing values in column 'Status' are: 3
Missing values in column 'Status Desc' are: 0
In [51]:
# Display records grouped by 'Status' and their corresponding 'Status Desc'

print()
print("Records grouped by 'Status' and their corresponding 'Status Desc':")
df.groupby(['Status', 'Status Desc']).size()
Records grouped by 'Status' and their corresponding 'Status Desc':
Out[51]:
Status  Status Desc 
13      UNK                   1
19      UNK                   1
AA      Adult Arrest     219426
AO      Adult Other      251366
CC      UNK                  29
IC      Invest Cont     1623298
JA      Juv Arrest        15867
JO      Juv Other          5341
TH      UNK                   1
dtype: int64

Based on the outputs above we have the following observations for the 'Status' and 'Status Desc' columns:

  • There is one record with Status = 13 and corresponding Description UNK.
  • There is one record with Status = 19 and corresponding Description UNK.
  • There is one record with Status = TH and corresponding Description UNK.
  • There are 29 records with Status = CC and corresponding Description UNK.
  • There are 3 blanks in Status column and corresponding Description UNK.

Based on further analysis, by looking at the actual records, we found the following:

  • Status Desc 'UNK' meant UNKNOWN.
  • The records with status codes 13 and 19 had area codes 13 and 19 respectively.
  • Status codes 'TH' and 'CC' did not give a clear indication as to what they meant. However, the Status Desc 'UNK' indicated that these codes were invalid.

Hence, we concluded that these invalid values could have been data entry errors. And we would replace these invalid values with the Status code 'IC' and Status Desc 'Invest Cont' which are the default values for these fields.

Imputation on Columns: 'Status' and 'Status Desc'

In [52]:
# Replace 'UNK' in 'Status Desc' column with 'Invest Cont'

df.loc[ df['Status Desc']=='UNK', 'Status Desc' ] = 'Invest Cont'
In [53]:
# Replace records with status codes '13', '19', 'TH' and 'CC' with 'IC'

df.loc[ df['Status']=='13', 'Status' ] = 'IC'
df.loc[ df['Status']=='19', 'Status' ] = 'IC'
df.loc[ df['Status']=='TH', 'Status' ] = 'IC'
df.loc[ df['Status']=='CC', 'Status' ] = 'IC'
In [54]:
# Replace blanks in 'Status' column with 'IC'

df['Status'].fillna('IC', inplace = True) 

Post-Imputation on Columns: 'Status' and 'Status Desc'

In [55]:
# Identify missing values in 'Status' and 'Status Desc' columns

print()
print( "Missing values in column 'Status' are: " + str(df['Status'].isnull().sum()) )
print( "Missing values in column 'Status Desc' are: " + str(df['Status Desc'].isnull().sum()) )
Missing values in column 'Status' are: 0
Missing values in column 'Status Desc' are: 0
In [56]:
# Display records grouped by 'Status' and their corresponding 'Status Desc'

print()
print("Records grouped by 'Status' and their corresponding 'Status Desc':")
df.groupby(['Status', 'Status Desc']).size()
Records grouped by 'Status' and their corresponding 'Status Desc':
Out[56]:
Status  Status Desc 
AA      Adult Arrest     219426
AO      Adult Other      251366
IC      Invest Cont     1623333
JA      Juv Arrest        15867
JO      Juv Other          5341
dtype: int64

As seen in the output above, there are now NO records with invalid 'Status' and 'Status Desc' values.

Exploring Columns: 'Crm Cd' and 'Crm Cd Desc'

In [57]:
# Identify missing values in 'Crm Cd' and 'Crm Cd Desc' columns

print()
print( "Missing values in column 'Crm Cd' are: " + str(df['Crm Cd'].isnull().sum()) )
print( "Missing values in column 'Crm Cd Desc' are: " + str(df['Crm Cd Desc'].isnull().sum()) )
Missing values in column 'Crm Cd' are: 0
Missing values in column 'Crm Cd Desc' are: 0
In [58]:
# Display records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc'

print()
print("Records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc':")
df.groupby(['Crm Cd', 'Crm Cd Desc']).size()
Records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc':
Out[58]:
Crm Cd  Crm Cd Desc                                         
110     CRIMINAL HOMICIDE                                        2773
113     MANSLAUGHTER, NEGLIGENT                                     5
121     RAPE, FORCIBLE                                          10327
122     RAPE, ATTEMPTED                                          1109
210     ROBBERY                                                 83854
                                                                ...  
950     DEFRAUDING INNKEEPER/THEFT OF SERVICES, OVER $400         238
951     DEFRAUDING INNKEEPER/THEFT OF SERVICES, $400 & UNDER     2158
952     ABORTION/ILLEGAL                                            7
954     CONTRIBUTING                                              183
956     LETTERS, LEWD  -  TELEPHONE CALLS, LEWD                 21209
Length: 142, dtype: int64

As seen above there are NO errors in 'Crm Cd' and 'Crm Cd Desc' columns.

Exploring Columns: 'Weapon Used Cd' and 'Weapon Desc'

In [59]:
# Identify missing values in 'Weapon Used Cd' and 'Weapon Desc' columns

print()
print( "Missing values in column 'Weapon Used Cd' are: " + str(df['Weapon Used Cd'].isnull().sum()) )
print( "Missing values in column 'Weapon Desc' are: " + str(df['Weapon Desc'].isnull().sum()) )
Missing values in column 'Weapon Used Cd' are: 1404863
Missing values in column 'Weapon Desc' are: 1404864
In [60]:
# As seen in the output above, there is one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'
# Display the one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'

df[ (df['Weapon Desc'].isnull()) & (df['Weapon Used Cd'].notnull()) ]
Out[60]:
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
196263 102114892 07/26/2010 12:00:00 AM 07/25/2010 12:00:00 AM 335 21 Topanga 2136 1 230 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT ... IC Invest Cont 230.0 NaN NaN NaN 7300 VARIEL AV NaN 34.2029 -118.5929

1 rows × 28 columns

In [61]:
# Display records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc'

print()
print("Records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc':")
df.groupby(['Weapon Used Cd', 'Weapon Desc']).size()
Records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc':
Out[61]:
Weapon Used Cd  Weapon Desc               
101.0           REVOLVER                       5565
102.0           HAND GUN                      34114
103.0           RIFLE                           867
104.0           SHOTGUN                        1203
105.0           SAWED OFF RIFLE/SHOTGUN         154
                                              ...  
512.0           MACE/PEPPER SPRAY              4034
513.0           STUN GUN                        636
514.0           TIRE IRON                       377
515.0           PHYSICAL PRESENCE              1784
516.0           DOG/ANIMAL (SIC ANIMAL ON)       40
Length: 79, dtype: int64

Based on the outputs above we have the following observations for 'Weapon Used Cd' and 'Weapon Desc' columns:

  • There are 1404863 records which do not have a 'Weapon Used Cd'.
  • There are 1404864 records which do not have a 'Weapon Desc'. Meaning these crimes did not involve any weapon.

There is one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'. Looking closely at the record output it is clear that the crime was committed with a 'DEADLY WEAPON'. So we will conclude that the description of the weapon was not available. Since this is just one record, we will leave the record as is.

Exploring Columns: 'Premis Cd' and 'Premis Desc'

In [62]:
# Identify missing values in 'Premis Cd' and 'Premis Desc' columns

print()
print( "Missing values in column 'Premis Cd' are: " + str(df['Premis Cd'].isnull().sum()) )
print( "Missing values in column 'Premis Desc' are: " + str(df['Premis Desc'].isnull().sum()) )
Missing values in column 'Premis Cd' are: 53
Missing values in column 'Premis Desc' are: 187
In [63]:
# Display records grouped by 'Premis Cd' and their corresponding 'Premis Desc'

print()
print("Records grouped by 'Premis Cd' and their corresponding 'Premis Desc':")
df.groupby(['Premis Cd', 'Premis Desc']).size()
Records grouped by 'Premis Cd' and their corresponding 'Premis Desc':
Out[63]:
Premis Cd  Premis Desc                       
101.0      STREET                                472827
102.0      SIDEWALK                              105821
103.0      ALLEY                                  13763
104.0      DRIVEWAY                               42544
105.0      PEDESTRIAN OVERCROSSING                  258
                                                  ...  
967.0      MTA - GOLD LINE - CHINATOWN               16
968.0      MTA - GOLD LINE - LINCOLN/CYPRESS         17
969.0      MTA - GOLD LINE - HERITAGE SQ             10
970.0      MTA - GOLD LINE - SOUTHWEST MUSEUM        15
971.0      MTA - GOLD LINE - HIGHLAND PARK           25
Length: 321, dtype: int64

Imputation on Premise Cd and Premise Desc are NOT complete and could take some time

Exploring Columns: 'DR_NO', 'Date Rptd', 'DATE OCC', 'TIME OCC', 'Rpt Dist No', 'Part 1-2'

In [64]:
# Identify missing values in 'DR_NO', 'Date Rptd', 'DATE OCC', 'TIME OCC', 'Rpt Dist No', 'Part 1-2' columns

print()
print( "Missing values in column 'DR_NO' are: " + str(df['DR_NO'].isnull().sum()) )
print( "Missing values in column 'Date Rptd' are: " + str(df['Date Rptd'].isnull().sum()) )
print( "Missing values in column 'DATE OCC' are: " + str(df['DATE OCC'].isnull().sum()) )
print( "Missing values in column 'TIME OCC' are: " + str(df['TIME OCC'].isnull().sum()) )
print( "Missing values in column 'Rpt Dist No' are: " + str(df['Rpt Dist No'].isnull().sum()) )
print( "Missing values in column 'Part 1-2' are: " + str(df['Part 1-2'].isnull().sum()) )
Missing values in column 'DR_NO' are: 0
Missing values in column 'Date Rptd' are: 0
Missing values in column 'DATE OCC' are: 0
Missing values in column 'TIME OCC' are: 0
Missing values in column 'Rpt Dist No' are: 0
Missing values in column 'Part 1-2' are: 0
In [65]:
# Display min and max values from 'TIME OCC' column so as to verify if the time values had an errors

print()
print("Minimum value in 'TIME' column: " + str(df['TIME OCC'].min()) )
print("Maximum value in 'TIME' column: " + str(df['TIME OCC'].max()) )
Minimum value in 'TIME' column: 1
Maximum value in 'TIME' column: 2359

As seen above there are NO errors in the TIME OCC column. The time values fall within the range of 0000 till 2359 (Military time).

Find Duplicate rows and columns in the data set

In [66]:
# Identify duplicate rows in the entire data set, using information from all the columns
# The below code marks duplicates as 'True' except for the first occurrence.
# df.duplicated(subset=None, keep='first')

print()
print("Grouping non-duplicates into False bucket and duplicates into True bucket:")
df.groupby( [df.duplicated(subset=None, keep='first')] ).size()
Grouping non-duplicates into False bucket and duplicates into True bucket:
Out[66]:
False    2115333
dtype: int64

As seen in the output above, There are NO duplicate rows in the current data set.

From the outputs of "df.info()" and "display(HTML(data_desc.to_html()))" given above, we conclude that there are NO duplicate columns in the current data set.

However, it is important to note that the information in columns "Crm Cd" and "Crm Cd 1" is basically the same. So when building models we would make use of only one of these columns.

Find Outliers in the data set

When analyzing each individual column in the data set the only outliers we encountered were two individuals with age 114 and 118. The same is graphically represented in the box-plot below.

Please refer to section "Exploring Column: Vict Age" for more details.

In [67]:
# Box plot to display outliers in the 'Vict Age' column.

df.boxplot(column='Vict Age', return_type='axes');
In [68]:
# Find Missing values in each column

df.isnull().sum()
Out[68]:
DR_NO                   0
Date Rptd               0
DATE OCC                0
TIME OCC                0
AREA                    0
AREA NAME               0
Rpt Dist No             0
Part 1-2                0
Crm Cd                  0
Crm Cd Desc             0
Mocodes            228017
Vict Age                0
Vict Sex                0
Vict Descent            0
Premis Cd              53
Premis Desc           187
Weapon Used Cd    1404863
Weapon Desc       1404864
Status                  0
Status Desc             0
Crm Cd 1               10
Crm Cd 2          1975931
Crm Cd 3          2111834
Crm Cd 4          2115229
LOCATION                0
Cross Street      1759943
LAT                     0
LON                     0
dtype: int64

Simple Statistics

[10 points] Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful.

In [69]:
# Summary of attributes in the dataframe

df.describe()
Out[69]:
DR_NO TIME OCC AREA Rpt Dist No Part 1-2 Crm Cd Vict Age Premis Cd Weapon Used Cd Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LAT LON
count 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115333e+06 2.115280e+06 710470.000000 2.115323e+06 139402.000000 3499.000000 104.000000 2.115333e+06 2.115333e+06
mean 1.479209e+08 1.359865e+03 1.108682e+01 1.155070e+03 1.446029e+00 5.073349e+02 3.735281e+01 3.111814e+02 371.371934 5.071590e+02 949.996428 972.210346 975.105769 3.407823e+01 -1.183590e+02
std 2.887068e+07 6.470967e+02 6.012440e+00 6.012589e+02 4.970787e-01 2.106272e+02 1.478579e+01 2.113121e+02 113.469024 2.104900e+02 125.680743 85.516627 81.276338 1.159865e-01 1.061292e-01
min 1.208575e+06 1.000000e+00 1.000000e+00 1.000000e+02 1.000000e+00 1.100000e+02 2.000000e+00 1.010000e+02 101.000000 1.100000e+02 210.000000 93.000000 421.000000 3.334270e+01 -1.188279e+02
25% 1.214242e+08 9.300000e+02 6.000000e+00 6.430000e+02 1.000000e+00 3.300000e+02 2.800000e+01 1.020000e+02 400.000000 3.300000e+02 998.000000 998.000000 998.000000 3.401060e+01 -1.184364e+02
50% 1.508087e+08 1.430000e+03 1.100000e+01 1.189000e+03 1.000000e+00 4.420000e+02 3.200000e+01 2.100000e+02 400.000000 4.420000e+02 998.000000 998.000000 998.000000 3.406240e+01 -1.183295e+02
75% 1.715119e+08 1.900000e+03 1.600000e+01 1.668000e+03 2.000000e+00 6.260000e+02 4.600000e+01 5.010000e+02 400.000000 6.260000e+02 998.000000 998.000000 998.000000 3.417580e+01 -1.182779e+02
max 9.102204e+08 2.359000e+03 2.100000e+01 2.199000e+03 2.000000e+00 9.560000e+02 1.180000e+02 9.710000e+02 516.000000 9.990000e+02 999.000000 999.000000 999.000000 3.479070e+01 -1.176596e+02

Victim Age appears to be highest in the thirties and goes down from there. This is

In [70]:
df['Vict Age'].mean()
Out[70]:
37.35280733577172
In [71]:
df.hist(column='TIME OCC')
Out[71]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11d2946a0>]],
      dtype=object)

Visualize Attributes

[15 points] Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate.

https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972

This is using

In [72]:
# Extracting Year of Crime as an attribute

df['year'] = pd.DatetimeIndex(df['DATE OCC']).year
In [73]:
# Extracting Month of Crime as an attribute

df['month'] = pd.DatetimeIndex(df['DATE OCC']).month_name()
In [74]:
plt.style.use('ggplot')

df_area = df.groupby(by=['AREA NAME'])
area_crime_count = df_area['AREA NAME'].count()
area_crime_count.sort_values().plot.barh(title= 'Crime Reported Over Last 10 Years')
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1295db278>
In [75]:
df['DayHr'] = pd.cut(df['TIME OCC'],[0,59,159,259,359,459,559,659,759,859,959,1059,1159,1259,1359,1459,1559,1659,1759,1859,1959,2059,2159,2259,2359],24,labels=['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24'])
df_dayhr_tmp = df.groupby('DayHr').count().reset_index().iloc[:,[0,1]]
df_dayhr = df_dayhr_tmp.rename(columns={df_dayhr_tmp.columns[0]: "HrOfDay", df_dayhr_tmp.columns[1]: "Count"})
#ax1 = plt.pie(df_daytime.Count,labels=df_daytime.TimeOfDay,autopct='%1.1f%%')
#plt.title('Crime Occurance by Time of Day')
df_dayhr
plt.bar(df_dayhr.HrOfDay,df_dayhr.Count)
plt.title('Crime Count by the hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Total number of crimes reported')
Out[75]:
Text(0, 0.5, 'Total number of crimes reported')
In [76]:
# Add comment on code pick up

df['DayType'] = pd.cut(df['TIME OCC'],[0,359,759,1159,1559,1959,2359],6,labels=['00hrs - 04hrs','04hrs - 08hrs','08hrs - 12hrs','12hrs - 16hrs','16hrs - 20hrs','20hrs - 00hrs'])
df_daytime_tmp = df.groupby('DayType').count().reset_index().iloc[:,[0,1]]
df_daytime = df_daytime_tmp.rename(columns={df_daytime_tmp.columns[0]: "TimeOfDay", df_daytime_tmp.columns[1]: "Count"})
ax1 = plt.pie(df_daytime.Count,labels=df_daytime.TimeOfDay,autopct='%1.1f%%')
plt.title('% Crime Reports in 4hrs window')
Out[76]:
Text(0.5, 1.0, '% Crime Reports in 4hrs window')
In [77]:
df_crimecode_grp = df.groupby(by=['Crm Cd','Crm Cd Desc']).count().reset_index().iloc[:,[0,1,2]]
df_crimecode = df_crimecode_grp.rename(columns={df_crimecode_grp.columns[0]: "Crime Code", df_crimecode_grp.columns[1]: "Description",df_crimecode_grp.columns[2]: "Count"})
df_crimecode['%']=100*df_crimecode.Count/df_crimecode.Count.sum()
df_crimecode.sort_values(['%'],inplace=True,ascending=False)
df_crimecode['Cum %']=df_crimecode['%'].cumsum()
#plt.bar(df_crimecode.head(10)['Crime Code'].apply(str),df_crimecode.head(10)['%'],align='center')
plt.barh(df_crimecode.head(10)['Description'],df_crimecode.head(10)['%'],align='center')
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Crime Types')
plt.gca().invert_yaxis()
In [78]:
df_premiscode_grp = df.groupby(by=['Premis Cd','Premis Desc']).count().reset_index().iloc[:,[0,1,2]]
df_premiscode = df_premiscode_grp.rename(columns={df_premiscode_grp.columns[0]: "Premis Code", df_premiscode_grp.columns[1]: "Description",df_premiscode_grp.columns[2]: "Count"})
df_premiscode['%']=100*df_premiscode.Count/df_premiscode.Count.sum()
df_premiscode.sort_values(['%'],inplace=True,ascending=False)
df_premiscode['Cum %']=df_premiscode['%'].cumsum()
#plt.bar(df_crimecode.head(10)['Crime Code'].apply(str),df_crimecode.head(10)['%'],align='center')
plt.barh(df_premiscode.head(10)['Description'],df_premiscode.head(10)['%'],align='center')
plt.gca().invert_yaxis()
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Crime Premises')
Out[78]:
Text(0.5, 1.0, '% Top 10 Crime Premises')

This histogram of victims by ages lets us see that there is a right skew to the distribution of ages. The peak is around the earlier noted mean of 37.

In [79]:
df.hist(column='Vict Age')
plt.xlabel('Age')
plt.title('Victim Counts by Age')
Out[79]:
Text(0.5, 1.0, 'Victim Counts by Age')

LA City Crime Mapped over City Boundaries

This was largely developed using a Geopandas tutorial found on [Towards Datascience] (https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972) and using shape data of the City Boundaries from LA City Geo Hub. These shape files consist of many points and give the background of our map. This allows us to take the geolocational data and overlay it on a map of the LA City Boundaries.

In [80]:
#create a geopandas dataframe and convert lat/long to point geometry:
geometry = [Point(xy) for xy in zip(df["LON"], df["LAT"])]
geometry[:3]
Out[80]:
[<shapely.geometry.point.Point at 0x11fa974a8>,
 <shapely.geometry.point.Point at 0x11dae9588>,
 <shapely.geometry.point.Point at 0x129bf7e48>]
In [81]:
#Tell it that we are using Lat/Long as our coordinates system
crs = {'init': 'epsg:4326'}

geo_df = gpd.GeoDataFrame(df, crs = crs, geometry = geometry)
geo_df.head(2)
/Users/juliacodes/Documents/GitHub/MSDS-ML1-VisualizationAndDataProcessing/lacrime-env/lib/python3.6/site-packages/pyproj/crs/crs.py:53: FutureWarning:

'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6

Out[81]:
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Crm Cd 4 LOCATION Cross Street LAT LON year month DayHr DayType geometry
0 1307355 02/20/2010 12:00:00 AM 02/20/2010 12:00:00 AM 1350 13 Newton 1385 2 900 VIOLATION OF COURT ORDER ... NaN 300 E GAGE AV NaN 33.9825 -118.2695 2010 February 14 12hrs - 16hrs POINT (-118.26950 33.98250)
1 11401303 09/13/2010 12:00:00 AM 09/12/2010 12:00:00 AM 45 14 Pacific 1485 2 740 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... ... NaN SEPULVEDA BL MANCHESTER AV 33.9599 -118.3962 2010 September 1 00hrs - 04hrs POINT (-118.39620 33.95990)

2 rows × 33 columns

In [82]:
#This was just to limit data
#geo_400 = geo_df[geo_df['TIME OCC'] == 400].describe()
#geo_400 = geo_400[geo_df['LAT'] != 0].describe()
geo_df[(geo_df['Vict Age']>=18) & (geo_df['Vict Age']<40)]
Out[82]:
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Crm Cd 4 LOCATION Cross Street LAT LON year month DayHr DayType geometry
1 11401303 09/13/2010 12:00:00 AM 09/12/2010 12:00:00 AM 45 14 Pacific 1485 2 740 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... ... NaN SEPULVEDA BL MANCHESTER AV 33.9599 -118.3962 2010 September 1 00hrs - 04hrs POINT (-118.39620 33.95990)
2 70309629 08/09/2010 12:00:00 AM 08/09/2010 12:00:00 AM 1515 13 Newton 1324 2 946 OTHER MISCELLANEOUS CRIME ... NaN 1300 E 21ST ST NaN 34.0224 -118.2524 2010 August 16 12hrs - 16hrs POINT (-118.25240 34.02240)
5 100100506 01/05/2010 12:00:00 AM 01/04/2010 12:00:00 AM 1650 1 Central 162 1 442 SHOPLIFTING - PETTY THEFT ($950 & UNDER) ... NaN 700 W 7TH ST NaN 34.0480 -118.2577 2010 January 17 16hrs - 20hrs POINT (-118.25770 34.04800)
8 100100510 01/09/2010 12:00:00 AM 01/09/2010 12:00:00 AM 230 1 Central 171 1 230 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT ... NaN 800 W OLYMPIC BL NaN 34.0450 -118.2640 2010 January 3 00hrs - 04hrs POINT (-118.26400 34.04500)
10 100100521 01/14/2010 12:00:00 AM 01/14/2010 12:00:00 AM 1445 1 Central 118 2 624 BATTERY - SIMPLE ASSAULT ... NaN 900 N BROADWAY NaN 34.0640 -118.2375 2010 January 15 12hrs - 16hrs POINT (-118.23750 34.06400)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2115325 191307168 02/28/2019 12:00:00 AM 02/28/2019 12:00:00 AM 700 13 Newton 1394 1 510 VEHICLE - STOLEN ... NaN 100 E 67TH ST NaN 33.9788 -118.2739 2019 February 8 04hrs - 08hrs POINT (-118.27390 33.97880)
2115326 190906699 02/24/2019 12:00:00 AM 02/23/2019 12:00:00 AM 2220 9 Van Nuys 904 1 210 ROBBERY ... NaN 7600 WILLIS AV NaN 34.2085 -118.4553 2019 February 23 20hrs - 00hrs POINT (-118.45530 34.20850)
2115328 190608903 03/28/2019 12:00:00 AM 03/28/2019 12:00:00 AM 400 6 Hollywood 644 1 648 ARSON ... NaN 1400 N LA BREA AV NaN 34.0962 -118.3490 2019 March 5 04hrs - 08hrs POINT (-118.34900 34.09620)
2115331 191716777 10/17/2019 12:00:00 AM 10/16/2019 12:00:00 AM 1800 17 Devonshire 1795 1 420 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) ... NaN 17200 NAPA ST NaN 34.2266 -118.5085 2019 October 19 16hrs - 20hrs POINT (-118.50850 34.22660)
2115332 190805435 02/01/2019 12:00:00 AM 02/01/2019 12:00:00 AM 1615 8 West LA 852 1 330 BURGLARY FROM VEHICLE ... NaN 1700 BARRY AV NaN 34.0420 -118.4531 2019 February 17 16hrs - 20hrs POINT (-118.45310 34.04200)

1257768 rows × 33 columns

This shape file from the LA City Geohub allows us to create a map of LA to plot the locational Longitude/Latitude coordinates

In [83]:
#Read in the Shape file to create a general outline of the LA City Boundaries
la_map = gpd.read_file('Data/shape/City_Boundaries.shp')
#fig,ax = plt.subplots(figsize=(15,15))
#la_map.plot(ax=ax)

Looking at the overall boundaries, it is apparent that crime happens across ages in the tighter city limits at the lower bottom left.

In [84]:
#Finally we are able to plot the data on the map
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
#starting with the largest group and decreasing to limit as much overlay as possible
geo_df[(geo_df['Vict Age']>=18) & (geo_df['Vict Age']<=40)].plot(ax = ax, markersize = 10, color = "red", marker = 'o', label = 'Victim Aged 18 - 40')
geo_df[geo_df['Vict Age']>40].plot(ax = ax, markersize = 10, color = "orange", marker = 'o', label = 'Victim Over 40')
geo_df[geo_df['Vict Age']<18].plot(ax = ax, markersize = 10, color = "blue", marker = 'o', label = 'Victim Under 18')
plt.title('LA Crime By General Age groups \n (Under 18, 18 to 40, and Over 40)')
plt.legend(prop={'size':15})
Out[84]:
<matplotlib.legend.Legend at 0x11d30d6d8>
In [85]:
#Same plot, but looking to see if there are Sex specific areas that appear
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
geo_df[(geo_df['Vict Sex']=='M')].plot(ax = ax, markersize = 10, color = "blue", marker = 'o', label = 'Victim Aged 18 - 40')
geo_df[geo_df['Vict Sex']=='F'].plot(ax = ax, markersize = 10, color = "red", marker = 'o', label = 'Victim Under 18')
plt.title('LA Crime \n (Males Vs Females)')
plt.legend(prop={'size':15})
Out[85]:
<matplotlib.legend.Legend at 0x18092e898>

Explore Joint Attributes

[15 points] Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

Ideas

  • (Age, Area)
  • (Age, Crime Code)
  • (Age, Gender)
In [86]:
df_ad_grp = df.groupby(['AREA NAME', 'Vict Descent']).count().reset_index().iloc[:,[0,1,2]]
df_ad = df_ad_grp.rename(columns={df_ad_grp.columns[0]: "Area", df_ad_grp.columns[1]: "Descent", df_ad_grp.columns[2]: "Count"})
fig = px.treemap(df_ad, path=['Area', 'Descent'],values='Count')
fig.update_layout(title="Crime Area & Victim Descent Treemap",width=800, height=500,title_x=0.5)
fig.show()
In [87]:
df_area_yr_grp = df.groupby(by=['AREA NAME','year']).count().reset_index().iloc[:,[0,1,2]]
df_area_yr = df_area_yr_grp.rename(columns={df_area_yr_grp.columns[0]: "Area", df_area_yr_grp.columns[1]: "Year",df_area_yr_grp.columns[2]: "Count"})
df_area_yr_pivot = df_area_yr.pivot(index='Area', columns='Year', values='Count').transpose()

plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['77th Street'],marker='o',label= '77th Street')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Southwest'],marker='o',label= 'Southwest')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['N Hollywood'],marker='o',label= 'N Hollywood')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Pacific'],marker='o',label= 'Pacific',linewidth=3)
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Southeast'],marker='o',label= 'Southeast')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Mission'],marker='o',label= 'Mission')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Northeast'],marker='o',label= 'Northeast')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Newton'],marker='o',label= 'Newton')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Van Nuys'],marker='o',label= 'Van Nuys')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Hollywood'],marker='o',label= 'Hollywood')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Central'],marker='o',label= 'Central',linewidth=3, color ='red')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Topanga'],marker='o',label= 'Topanga')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Devonshire'],marker='o',label= 'Devonshire')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Olympic'],marker='o',label= 'Olympic')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Harbor'],marker='o',label= 'Harbor')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Rampart'],marker='o',label= 'Rampart')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['West Valley'],marker='o',label= 'West Valley')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['West LA'],marker='o',label= 'West LA')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Wilshire'],marker='o',label= 'Wilshire')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Foothill'],marker='o',label= 'Foothill')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Hollenbeck'],marker='o',label= 'Hollenbeck')
plt.ylim(5000,20000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2, -0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime Over 10 Years \n (Different Policing Areas)')
Out[87]:
Text(0.5, 1.0, 'LA Crime Over 10 Years \n (Different Policing Areas)')
In [88]:
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_sex_yr = df_sex_yr_grp.rename(columns={df_sex_yr_grp.columns[0]: "Sex", df_sex_yr_grp.columns[1]: "Year",df_sex_yr_grp.columns[2]: "Count"})
df_sex_yr_pivot = df_sex_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
plt.plot( df_sex_yr_pivot.index.values,df_sex_yr_pivot['M'],marker='o',label= 'M',linewidth=2,color="blue")
plt.plot( df_sex_yr_pivot.index.values,df_sex_yr_pivot['F'],marker='o',label= 'F',linewidth=2,color="red")
plt.ylim(50000,120000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2,0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime Over 10 Years \n (Males Vs Females)')
Out[88]:
Text(0.5, 1.0, 'LA Crime Over 10 Years \n (Males Vs Females)')
In [89]:
df_cent_crcd_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Crm Cd','Crm Cd Desc', 'year']).count().reset_index().iloc[:,[0,1,2,3]]
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_crcd_yr = df_cent_crcd_yr_grp.rename(columns={df_cent_crcd_yr_grp.columns[0]: "CrimeCode", df_cent_crcd_yr_grp.columns[1]: "Description",df_cent_crcd_yr_grp.columns[2]: "Year",df_cent_crcd_yr_grp.columns[3]: "Count"})
df_cent_crcd_yr_pivot = df_cent_crcd_yr.pivot(index='Description', columns='Year', values='Count').transpose()
for (columnName, columnData) in df_cent_crcd_yr_pivot.iteritems():
    incThresh = float(columnData.values[9])/float(columnData.values[4])
    if (incThresh > 10):
        plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='8',label= [str(columnName)],linewidth=4,color='purple')
    if (incThresh > 5 and (incThresh <= 10)):
        plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='s',label= [str(columnName)],linewidth=3,color='red')
    if ((incThresh > 4) and (incThresh <= 5)):
        plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='o',label= [str(columnName)],linewidth=2,linestyle='--',color='blue')
    if ((incThresh > 3) and (incThresh <= 4)):
        plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='*',label= [str(columnName)],linewidth=1,linestyle='-.',color='black')
    if ((incThresh > 2) and (incThresh <= 3)):
        plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='d',label= [str(columnName)],linewidth=0.5,linestyle=':',color='grey')
            
plt.ylim(0.1,100)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(2.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types in Central Area')
Out[89]:
Text(0.5, 1.0, 'Increasing Crime Types in Central Area')
In [90]:
df_cent_sex_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_sex_yr = df_cent_sex_yr_grp.rename(columns={df_cent_sex_yr_grp.columns[0]: "Sex", df_cent_sex_yr_grp.columns[1]: "Year",df_cent_sex_yr_grp.columns[2]: "Count"})
df_cent_sex_yr_pivot = df_cent_sex_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
#df_cent_sex_yr_pivot
plt.plot( df_cent_sex_yr_pivot.index.values,df_cent_sex_yr_pivot['M'],marker='o',label= 'M',linewidth=2,color="blue")
plt.plot( df_cent_sex_yr_pivot.index.values,df_cent_sex_yr_pivot['F'],marker='o',label= 'F',linewidth=2,color="red")
#plt.ylim(50000,120000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2,0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime In Central Area \n (Males Vs Females)')
Out[90]:
Text(0.5, 1.0, 'LA Crime In Central Area \n (Males Vs Females)')
In [91]:
df_cent_des_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Vict Descent','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_des_yr = df_cent_des_yr_grp.rename(columns={df_cent_des_yr_grp.columns[0]: "Sex", df_cent_des_yr_grp.columns[1]: "Year",df_cent_des_yr_grp.columns[2]: "Count"})
df_cent_des_yr_pivot = df_cent_des_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
#df_cent_des_yr_pivot
for (columnName, columnData) in df_cent_des_yr_pivot.iteritems():
        plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='o',label= [str(columnName)],linewidth=2)
plt.ylim(0.1,50)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(1.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types \n Central Area by Victim Descent')
Out[91]:
Text(0.5, 1.0, 'Increasing Crime Types \n Central Area by Victim Descent')
In [92]:
df_yr_month_grp = df.groupby(['year', 'month']).count().reset_index().iloc[:,[0,1,2]]
df_yr_month = df_yr_month_grp.rename(columns={df_yr_month_grp.columns[0]: "Year", df_yr_month_grp.columns[1]: "Month", df_yr_month_grp.columns[2]: "Count"})

#sns.heatmap(df_yr_month.pivot(index='Year', columns='Month', values='Count'),  linewidths=.5)

Explore Attributes and Class

[10 points] Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

So far interesting vars

  • Victim Age
  • LAT/Lon
  • Area

New Features

[5 points] Are there other features that could be added to the data or created from existing features? Which ones?

In the future we would like to get more data from the reported and occurrence dates. By subtracting we would like to see how long it took to report the crime. We could also use these dates in conjunctions with APIs to determine sunset time and get an idea of if it was day or night for day and time of occurrence. There is an open Sunrise-Sunset API (https://sunrise-sunset.org/api) that would take in the latitide and longitude as well to get precise times for that geolocation.

Another API could be used to see if the crime was committed on a Holiday. To get the most commonly given off Holidays, we could use Calendarific (https://calendarific.com/api-documentation) as they give the option to filter by type of holiday and limit to national holidays.

Some fields could just be extracted into new transformed fields. We could also denote crime occurrences by the day of the week they occured and whether it was on a weekend or not. Rather than working with the victim age as a continuous variable, it may be useful to group by demographics into age groups. Weapons not mentioned as present are represented as null in the Weapons Used Code field. This could be a single flag of Weapon Used that resolves to a true false. Again checking for nulls to get a charge count from the Crime code fields 1-4 could be put into a crime count field.

To make use of the MOCodes (Modus Operandi Codes) will take some processing to put them into a list and aggregate across the dataset. This could then be used to tally the most frequently used and dummy coding them as new columns that an incident either has or does not have. Rather than just adding potentially hundreds of new fields to each row.

The precincts are given and can be joined to other precinct datasets. They have a total area for the precinct (https://geohub.lacity.org/datasets/lapd-reporting-districts). We could find how many cars are assigned to the precint in the Basic Car Plan (http://www.lapdonline.org/search_results/content_basic_view/6528)

Exceptional Work

[10 points] You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.

In [93]:
#Test commit
#second test